Skip to content

fix(geo): tokenization-based keyword matching to prevent false positives#503

Merged
koala73 merged 6 commits intomainfrom
fix/geo-tagging-tokenization
Feb 28, 2026
Merged

fix(geo): tokenization-based keyword matching to prevent false positives#503
koala73 merged 6 commits intomainfrom
fix/geo-tagging-tokenization

Conversation

@koala73
Copy link
Owner

@koala73 koala73 commented Feb 28, 2026

Summary

Fixes #324 — supersedes #330 by @princelevant (preserving authorship via --author)

  • Problem: String.includes() keyword matching caused false positives: "assad" matched inside "ambassador", "hts" inside "rights", "house" inside "housing", etc.
  • Solution: Tokenization-based matching (Set.has() for O(1) lookups) with sub-part decomposition for possessives/hyphens
  • Scope: 9 consumer files updated across geo-hub, DeckGLMap, Map, CII, story-data, related-assets, tech-hub, and data-loader

Key design decisions

  • Single source of truth: src/utils/keyword-match.ts with tokenizeForMatch(), matchKeyword(), matchesAnyKeyword(), findMatchingKeywords()
  • Sub-part decomposition: "Assad's" → Set contains both "assad's" and "assad" — fixes the possessive false-negative from fix: use word-boundary regex for geo-tagging keyword matching #330
  • Multi-word phrase matching: "white house" matches as contiguous ordered tokens, so "the house is painted white" does NOT match
  • DC keywords cleaned: removed 'house' (matched "housing"/"warehouse") and 'us ' (trailing-space hack)
  • Damascus 'hts' keyword kept — safe with tokenization since tokens.has('hts') won't match "rights"

Intentionally NOT changed

  • analysis-constants.ts (includesKeyword/containsTopicKeyword) — different pipeline, blast radius concern
  • entity-index.ts — uses regex for match position extraction
  • country-geometry.ts — already uses \b regex correctly
  • Server-side handlers — separate scope for follow-up

Test plan

  • npx tsc --noEmit passes
  • node --test tests/geo-keyword-matching.test.mjs — 44 tests pass
  • False positive: "French Ambassador outlines new strategy" does NOT match Damascus
  • True positive: "Assad's forces advance in Syria" DOES match Damascus
  • False positive: "Human rights groups condemn" does NOT match Damascus via 'hts'
  • Multi-word: "White House announces budget" matches DC, "Housing market crashes" does NOT

cc @princelevant — thank you for the original investigation and tokenization approach in #330!

Replace String.includes() with tokenization-based Set.has() matching
across the geo-tagging pipeline. Prevents false positives like "assad"
matching inside "ambassador" and "hts" matching inside "rights".

- Add src/utils/keyword-match.ts as single source of truth
- Decompose possessives/hyphens ("Assad's" → includes "assad")
- Support multi-word phrase matching ("white house" as contiguous)
- Remove false-positive-prone DC keywords ('house', 'us ')
- Update 9 consumer files across geo-hub, map, CII, and asset systems
- Add 44 tests covering false positives, true positives, edge cases

Co-authored-by: karim <mirakijka@gmail.com>
Fixes #324
@chatgpt-codex-connector
Copy link

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@vercel
Copy link

vercel bot commented Feb 28, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
worldmonitor Ready Ready Preview, Comment Feb 28, 2026 6:56am
worldmonitor-finance Ready Ready Preview, Comment Feb 28, 2026 6:56am
worldmonitor-happy Ready Ready Preview, Comment Feb 28, 2026 6:56am
worldmonitor-startup Ready Ready Preview, Comment Feb 28, 2026 6:56am

Request Review

Address code review feedback:

P1a: Add suffix-aware matching for plurals and demonyms so existing
keyword lists don't regress (houthi→houthis, ukraine→ukrainian,
iran→iranian, israel→israeli, russia→russian, taiwan→taiwanese).
Uses curated suffix list + e-dropping rule to avoid false positives.

P1b: Expand conflictTopics arrays in DeckGLMap and Map with demonym
forms so "Iranian senate..." correctly registers as conflict topic.

P2: Replace inline test functions with real module import via tsx.
Tests now exercise the production keyword-match.ts directly.
The .mts test file wasn't covered by `node --test tests/*.test.mjs`.
Add `npx tsx --test tests/*.test.mts` so test:data runs both suites.
- Use tsx as test runner for both .mjs and .mts (single invocation)
- Removes ; separator which breaks on Windows cmd.exe
- Add tsx to devDependencies so it works in offline/CI environments
- Add wordMatches() for suffix-aware phrase matching so "South Korean"
  matches keyword "south korea" and "North Korean" matches "north korea"
- Add MIN_SUFFIX_KEYWORD_LEN=4 guard so short keywords like "ai", "us",
  "hts" only do exact-match (prevents "ais"→"ai", "uses"→"us" false positives)
- Add 5 new tests covering both fixes (58 total, all passing)
Add compound suffixes (ians, eans, ans, ns, is) to handle plural
demonym forms like "Iranians"→"iran", "Ukrainians"→"ukraine",
"Russians"→"russia", "Israelis"→"israel". Adds 5 new tests (63 total).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Geo-tagging uses substring matching, causing articles to be placed in wrong map regions

2 participants